(adapted from J. Silge and D. Robinson's Text Mining with R: a Tidy Approach)
In this notebook, you will be introduced to the basics of tidy text mining using the tidytext R library, which shares syntax with H. Wickham's popular tidyverse suite of packages (including ggplot2, a powerful graphic library).
The main datasets that you will work with are
AP Recaps data set from the TMNLP 1 notebook. %>%For this notebook, we do not aggregate all the library calls into a single call at the top. As an aside, the order of these calls can affect the analytical process as some libraries mask other libraries' methods.
%>%R is a functional language, which means nested parentheses, which make code hard difficult to read. The pipeline operator %>% and the package dplyr can be used to remedy the situation.
Hadley Wickham provided an example in 2014 to illustrate how it works:
hourly_delay <- filter(
summarise(
group_by(
filter(
flights,
!is.na(dep_delay)
),
date, hour
),
delay = mean(dep_delay),
n = n()
),
n > 10
)
Take some time to figure out what is supposed to be happening here.
The pipeline operator eschews nesting function calls in favor of passing data from one function to the next:
hourly_delay <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(date, hour) %>%
summarise(
delay = mean(dep_delay),
n = n() ) %>%
filter(n > 10)
The beauty of this approach is that it can be 'read' aloud to discover what the block of code is meant to do.
the flights data frame is 1. filtered (to remove missing values of the dep_delay variable), 2. grouped by hours within days, 3. the mean delay is calculated within groups, and 4. the mean delay is returned for those hours with more than n > 10 flights.
The pipeline rules are simple: the object on the left hand side is passed as the first argument to the function on the right hand side.
data %>% function is the same as function(data)data %>% function(arg=value) is the same as function(data, arg=value)References: https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
In a Text Mining context, text typically comes in one of 4 formats:
Tidy data has specific structure:
Tidy text is a table with one token (single word, $n$-gram, sentence, paragraph) per row. As we've seen before, words have to be tokenized to commonly-used units of text.
Consider the following haiku by master Matsuo Basho:
haiku <- c('In the twilight rain',
'these brilliant-hued hibiscus -',
'A lovely sunset')
haiku
Let's turn it into a data frame:
library(dplyr)
haiku.df <- data.frame(text=haiku, stringsAsFactors = FALSE) # last parameter is important, we want to be able to separate the words
haiku.df
library(tidytext)
haiku.df %>% unnest_tokens(word,text) # which should unnest the tokens in text.df with parameters word and text
unnest_token() separates the tokens (words, in this example), strips away the punctuation, converts to lowercase, records the token line and the order.
In general, we
Let's illustrate the flow with some of Shakespeare's plays.
(Gutenberg Project ID - Romeo and Juliet: 1112, Hamlet: 1524, Macbeth: 2264, A Midsummer Night's Dream: 2242, etc.)
References: http://www.gutenberg.org/ebooks/search/?query=Shakespeare
library(gutenbergr)
will_shakespeare <-gutenberg_download(c(1112,1524,2264,2242,2267,1120,1128,2243,23042,1526,1107,2253,1121,1103,2240,2268,1535,1126,1539,23046,1106,2251,2250,1790,2246,1114,1108,2262,1109,1537))
Now we produce a Tidy Text dataset
library(stringr) # necessary to use str_extract
tidy_ws <- will_shakespeare %>%
unnest_tokens(word,text) %>%
mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
anti_join(stop_words) %>% # removing the heading business
na.omit() # remove NAs
head(tidy_ws)
For which we can easily produce a word count:
tidy_ws %>%
count(word, sort=TRUE)
library(ggplot2)
tidy_ws %>%
count(word, sort=TRUE) %>%
filter(n > 500) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab("Frequent words in selected Shakespeare plays") +
ylab("Word count") +
coord_flip()
kit_marlowe <-gutenberg_download(c(779,1094,901,20288,16169,1589,1496,18781))
tidy_km <- kit_marlowe %>%
unnest_tokens(word,text) %>%
mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
anti_join(stop_words) %>% # removing the stop words in the tidytext dataset stop_words
na.omit() # remove NAs
head(tidy_km)
We'll look at both of these datasets simultaneously. In order to do so, we'll build a word_count data set with the help of the pipeline operator. One of its advantages is that we can build the query sequentially and easily see the output at various stages.
We'll start by binding tidy_ws and tidy_km into a single dataset.
library(tidyr)
library(stringr) # for str_extract
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe"))
head(word_count)
tail(word_count)
Then we'll execute a word count for each of the authors (note the sorting of the outputs, and the new field $n$).
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
count(author,word)
head(word_count)
tail(word_count)
To follow the tidy approach, we need word_count to have a unique value for each word for each author. In this case, grouping by author won't have an effect (why?), but let's add the line anyway for completeness' sake.
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
count(author,word) %>% # count the word for each author's work
group_by(author)
head(word_count)
tail(word_count)
The size of the datasets was different, as we're using a higher number of Shakespeare plays. Rather than look at raw counts (which would naturally favour the Bard's output), we'll look at proportions: $$\frac{\mbox{number of occurrences of a specific term in the dataset}}{\mbox{total number of terms in the dataset}} =\frac{n}{\sum n}$$
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
count(author,word) %>% # count the word for each author's work
group_by(author) %>% # provide the output for each author
mutate(proportion = n / sum(n))
head(word_count)
tail(word_count)
We can now remove the raw counts to focus on the proportions.
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
count(author,word) %>% # count the word for each author's work
group_by(author) %>% # provide the output for each author
mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
select(-c(n))
head(word_count)
tail(word_count)
We reshape the word_count dataset to faciliate the analysis: each word is now represented by a row, and the proportion of the time it appears in each author's writings is shown in the columns.
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
count(author,word) %>% # count the word for each author's work
group_by(author) %>% # provide the output for each author
mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
select(-c(n)) %>% # removes the count (keeps the proportion)
spread(author,proportion)
head(word_count,20)
tail(word_count)
Let's see what proportion of each author's output is not found in the other's:
(WS_nKM <- sum(word_count$WillShakespeare[is.na(word_count$KitMarlowe)])) # % of Shakespeare's output not in Marlowe
(KM_nWS <- sum(word_count$KitMarlowe[is.na(word_count$WillShakespeare)])) # % of Marlowe's output not in Shakespeare
Finally, we re-organize the table for use with ggplot() (strictly-speaking, this step is not mandatory).
word_count <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_km,author="KitMarlowe")) %>% # create a new variables which will identify the author
count(author,word) %>% # count the word for each author's work
group_by(author) %>% # provide the output for each author
mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
select(-c(n)) %>% # removes the count (keeps the proportion)
spread(author,proportion) %>% # reshapes the table in a tidy format
gather(author, proportion, `WillShakespeare`)
head(word_count)
tail(word_count)
library(scales)
ggplot(word_count, aes(x = proportion, y = `KitMarlowe`, color = abs(`KitMarlowe` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
scale_y_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
scale_color_gradient(limits = c(0, 0.001), low = "red", high = "gray75") +
theme(legend.position="none") +
labs(y = "Kit Marlowe", x = "Will Shakespeare")
Words near the straight line are used with roughly the same frequency by both authors. For instance: lord, king, thou in the high-frequency spectrum and beasts, estate and glory in the low-frequency spectrum.
Words away from the straight line are used more frequently by one of the authors: lady and caesar seem to be used relatively more often by Shakespeare than by Marlowe, and aeneas and doctor are in the opposite situation (these terms are specific to plays).
The colour is related to the distance between the relative frequencies of a term for each author (red is close, gray is far). What could explain the shape of the red cloud (large at the bottom, thin at the top)?
Note the presence of both beasts and beast -- what does that tell you about the texts? And are you surprised about the prevalence of terms like `enter, exit and exeunt?
Finally, let's see if we can quantify the similarity in word usage.
cor.test(data = word_count, ~ proportion + `KitMarlowe`)
There's a fairly strong correlation (0.847077) between the relative term frequencies for the two wordsmiths (among those terms which are found in both text outputs -- see KM_nWS and WS_nKM). That's not entirely unexpected, since they were contemporaries: one would naively predict that the depth of their vocabulary and the way they deployed it would be linked, to some extent.
But without comparisons to other texts, it's hard to really put this value in perspective.
# Import text data
recaps <- read.csv(file="Data/Recap_data.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP_Recap
# Cast the data in a data frame
recaps.df <- data.frame(text=AP.recaps, stringsAsFactors = FALSE)
# Create a tidytext dataset
tidy_AP <- recaps.df %>%
unnest_tokens(word,text) %>%
mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
anti_join(stop_words) %>% # removing the stop words in the tidytext dataset stop_words
na.omit() # remove NAs
head(tidy_AP) # inspect
As demonstrated above, the tidy structure can easily be meshed with ggplot2 to produce insightful graphics.
word_count_2 <- bind_rows(mutate(tidy_ws,author="WillShakespeare"),mutate(tidy_AP,author="AP_recaps")) %>% # create a new variables which will identify the author
count(author,word) %>% # count the word for each author's work
group_by(author) %>% # provide the output for each author
mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
select(-c(n)) %>% # removes the count (keeps the proportion)
spread(author,proportion) %>% # reshapes the table in a tidy format
gather(author, proportion, `WillShakespeare`)
word_count_3 <- bind_rows(mutate(tidy_km,author="KitMarlowe"),mutate(tidy_AP,author="AP_recaps")) %>% # create a new variables which will identify the author
count(author,word) %>% # count the word for each author's work
group_by(author) %>% # provide the output for each author
mutate(proportion = n / sum(n)) %>% # rather than raw counts, look at proportion of usage
select(-c(n)) %>% # removes the count (keeps the proportion)
spread(author,proportion) %>% # reshapes the table in a tidy format
gather(author, proportion, `KitMarlowe`)
ggplot(word_count_2, aes(x = proportion, y = `AP_recaps`, color = abs(`AP_recaps` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
scale_y_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
scale_color_gradient(limits = c(0, 0.001), low = "blue", high = "gray75") +
theme(legend.position="none") +
labs(y = "AP Recaps", x = "Will Shakespeare")
ggplot(word_count_3, aes(x = proportion, y = `AP_recaps`, color = abs(`AP_recaps` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
scale_y_log10(labels = percent_format(),limits=c(0.0001,0.02)) +
scale_color_gradient(limits = c(0, 0.001), low = "green", high = "gray75") +
theme(legend.position="none") +
labs(y = "AP Recaps", x = "Kit Marlowe")
The correlation computations can be done as above.
(WS_nAP <- sum(word_count_2$proportion[is.na(word_count_2$AP_recaps)])) # % of Shakespeare's output not in AP Recaps
(AP_nWS <- sum(word_count_2$AP_recaps[is.na(word_count_2$proportion)])) # % of AP Recaps's output not in Shakespeare
(KM_nAP <- sum(word_count_3$proportion[is.na(word_count_3$AP_recaps)])) # % of Marlowe's output not in AP Recaps
(AP_nKM <- sum(word_count_3$AP_recaps[is.na(word_count_3$proportion)])) # % of AP Recaps's output not in Marlowe
cor.test(data = word_count_2, ~ proportion + `AP_recaps`)
cor.test(data = word_count_3, ~ proportion + `AP_recaps`)
We have been using word, term, token, unit interchangeably when analyzing text up to now, as befits the Bag of "Words" approach.
It's not too hard to think of applications where the basic numerical unit is not the relative frequencies of single words, however, but the links between 2 or more words, in sucession or in co-occurrence.
Rather than tokenize some text by words, we can tokenize it by series of $n$ consecutive words (or $n$-grams).
Are there interesting bigrams in Shakespeare's plays? What would you expect the most common bigrams to be?
library(stringr) # necessary to use str_extract
tidy_ws.2 <- will_shakespeare %>%
unnest_tokens(bigram,text,token="ngrams",n=2) %>% # tokenize on bigrams
#mutate(bigram = str_extract(bigram,"[0-9a-zA-Z'\ ]+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
count(bigram,sort=TRUE) # produce a count and sort on decreasing frequency
tidy_ws.2
At first glance, among the top 10 most frequent bigrams, only one conveys even a sliver of information: my lord. Everything else is stopword material.
However, what about the 9th most frequent bigram? In a general context, to be is a "stopword" -- but there is at least a few specific instance in this context where that bigram is emphatically not just a "stopword".
Removing bigram stopwords is simple, although not as straigthforward as in the unigram case:
For the sake of this exercise, let's also remove words related to the printing business, and theatre terms.
library(tidyr) #
# we will append a number of words to the stop_words dataset
word = c("gutenberg","shakespeare"," ","etext","1990","1993","public","print","copies","membership"
,"commercial","commercially","electronic","download","distribution"
,"ff","f1","f2","f3","f4","NA","collier","ms","cap","txt","zip"
,"library","printed", "text","editions"
,"executive", "pobox", "fees", "million", "ascii", "legal", "61825", "2782"
,"director", "machine","readable","carnegie","mellon","university"
#,"exit", "exeunt", "enter", "scene", "act", "folio", "dramatis"
#,"mine","tis", "thine","thy", "thou","art","hast", "shalt","dost","thee"
#,"act_4","act_1","act_2","act_3","act_5","sc_1","sc_2","sc_3","sc_4","sc_5"
#,"sc_6","sc_7","sc_8","sc_9","sc_10","sc_11"
)
lexicon = rep("modern",length(word)) # let's call it the modern lexicon
addition = data.frame(word,lexicon)
stop_words_ws = rbind(stop_words,addition)
tidy_ws.2_cleaned <- tidy_ws.2 %>%
separate(bigram, c("FirstTerm","SecondTerm"), sep=" ") %>% # separate the bigrams on the space character
filter(!FirstTerm %in% stop_words_ws$word) %>% # only retain those rows for which 1st/2nd term is not in stop_words
filter(!SecondTerm %in% stop_words_ws$word) %>% # or rather, for which it is false that 1st/2nd term is in stop_word)
unite(bigram,FirstTerm,SecondTerm, sep=" ")
tidy_ws.2_cleaned
We can do a count by book too!
library(stringr) # necessary to use str_extract
tidy_ws.2.by_book <- will_shakespeare %>%
unnest_tokens(bigram,text,token="ngrams",n=2) %>% # tokenize on bigrams
count(gutenberg_id,bigram,sort=TRUE) %>% # produce a count and sort on decreasing frequency
separate(bigram, c("FirstTerm","SecondTerm"), sep=" ") %>% # separate the bigrams on the space character
filter(!FirstTerm %in% stop_words$word) %>% # only retain those rows for which 1st/2nd term is not in stop_words
filter(!SecondTerm %in% stop_words$word) %>% # or rather, for which it is false that 1st/2nd term is in stop_word)
unite(bigram,FirstTerm,SecondTerm, sep=" ") # re-unite the bigrams
head(tidy_ws.2.by_book)
Can we find the most common Nobles in each book? Yes. We. Can.
tidy_ws.2.by_book_nobles <- will_shakespeare %>%
unnest_tokens(bigram,text,token="ngrams",n=2) %>% # tokenize on bigrams
count(gutenberg_id,bigram,sort=TRUE) %>% # produce a count and sort on decreasing frequency
separate(bigram, c("FirstTerm","SecondTerm"), sep=" ") %>% # separate the bigrams on the space character
filter(!FirstTerm %in% stop_words$word) %>% # only retain those rows for which 1st/2nd term is not in stop_words
filter(!SecondTerm %in% stop_words$word) %>% # or rather, for which it is false that 1st/2nd term is in stop_word)
filter(FirstTerm %in% c("lord","queen","sir","duke","king")) %>% # which nobles are commonly mentionned in books?
unite(bigram,FirstTerm,SecondTerm, sep=" ") # re-unite the bigrams
tidy_ws.2.by_book_nobles
We can also include bigrams to find the tf-idf of bigrams in Shakespeare's plays, providing us with terms that have document-specific importance in the corpus.
ws_tf_idf <- tidy_ws.2.by_book %>%
count(gutenberg_id,bigram) %>% # count the bigrams per book
bind_tf_idf(bigram,gutenberg_id,nn) %>% # compute the tf-idf using bigrams as tokens
arrange(desc(tf_idf)) # sort by highest tf_idf
ws_tf_idf
Silge and Robinson provide a function that can visualize bigram networks
library(dplyr)
library(tidyr)
library(tidytext)
library(ggplot2)
library(igraph)
library(ggraph)
count_bigrams_ws <- function(dataset) {
word = c("gutenberg","shakespeare"," ","etext","1990","1993","public","print","copies","membership"
,"commercial","commercially","electronic","download","distribution"
,"ff","f1","f2","f3","f4","NA","collier","ms","cap","txt","zip"
,"library","printed", "text","editions"
,"executive", "pobox", "fees", "million", "ascii", "legal", "61825", "2782"
,"director", "machine","readable","carnegie","mellon","university"
,"exit", "exeunt", "enter", "scene", "act", "folio", "dramatis"
#,"mine","tis", "thine","thy", "thou","art","hast", "shalt","dost","thee"
#,"act_4","act_1","act_2","act_3","act_5","sc_1","sc_2","sc_3","sc_4","sc_5"
#,"sc_6","sc_7","sc_8","sc_9","sc_10","sc_11"
)
lexicon = rep("modern",length(word)) # let's call it the modern lexicon
addition = data.frame(word,lexicon)
stop_words_ws = rbind(stop_words,addition)
dataset %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("FirstTerm", "SecondTerm"), sep = " ") %>%
filter(!FirstTerm %in% stop_words_ws$word,
!SecondTerm %in% stop_words_ws$word) %>%
count(FirstTerm, SecondTerm, sort = TRUE)
}
visualize_bigrams <- function(bigrams) {
set.seed(2016)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
bigrams %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
}
ws_bigrams = will_shakespeare %>% count_bigrams_ws() # count the bigrams with the ws stopwords
ws_bigrams %>% filter(n>40,
!str_detect(FirstTerm,"\\d"),
!str_detect(SecondTerm,"\\d")) %>%
visualize_bigrams()